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Abstract 



^f-^ ' Stochastic chains with memory of variable length constitute an interesting family of 

stochastic chains of infinite order on a finite alphabet. The idea is that for each past, only a 
finite suffix of the past, called context, is enough to predict the next symbol. These models 
were first introduced in the information theory literature by Rissanen (1983) as a universal 
tool to perform data compression. Recently, they have been used to model up scientific 
data in areas as different as biology, linguistics and music. This paper presents a personal 
introductory guide to this class of models focusing on the algorithm Context and its rate of 



convergence. 

1 Introduction 

> 

Chains with memory of variable length appear in Rissanen's 1983 paper called A universal data 

l/"~) ■ compression system. His idea was to model a string of symbols as a realization of a stochastic 

chain where the length of the memory needed to predict the next symbol is not fixed, but is a 

deterministic function of the string of the past symbols. 

f->) ■ Considering a memory of variable length is a practical way to overcome the well known 

difficulty of the exponentially growing number of parameters which are needed to describe a 

Markov chain when its order increases. However if one wants to fit accurately complex data 

using a Markov chain of fixed order, one has to use a very high order. And this means that 

to estimate the parameters of the model we need huge samples, which makes this approach 

H \ unsuitable for many practical issues. 

It turns out that in many important scientific data, the length of the relevant portion of the 
past is not fixed, on the contrary it depends on the past. For instance, in molecular biology, the 
translation of a gene into a protein is initiated by a fixed specific sequence of nucleotide bases 
called start codon. In other words, the start codon designs the end of the relevant portion of the 
past to be considered in the translation. 

The same phenomenon appears in other scientific domains. For instance in linguistics, both 
in phonology and in syntax, there is the notion of domains in which the grammar operates to 
define admissible strings of forthcoming symbols. In other terms, the boundary of the linguistic 
domain defines the relevant part of the past for the processing of the next linguistic units. 
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Rissanen's ingenious idea was to construct a stochastic model that generalizes this notion of 
relevant domain to any kind of symbolic strings. 

To be more precise, Rissanen (1983) called context the relevant part of the past. The stochas- 
tic model is defined by the set of all contexts and an associated family of transition probabilities. 

Models with memory of variable length are not only less expensive than the classical fixed 
order Markov chains, but also much more clever since they take into account the structural 
dependencies present in the data. This is precisely what the set of contexts expresses. 

Rissanen has introduced models having memory of variable length as a universal system of 
data compression. His goal was to compress in real time a string of symbols generated by an 
unknown source. To do this, we have to estimate at each step the length of the context of the 
string observed until that time step, as well as the associated transition probabilities. 

If we knew the contexts, then the estimation of the associated transition probabilities could 
be done using a classical procedure such as maximum likelihood estimation. Therefore, the main 
point is to put hands on the context length. In his seminal 1983 paper, Rissanen solved this 
problem by introducing the algorithm Context. This algorithm estimates in a consistent way 
both the length of the context as well as the associated transition probability. 

The class of models with memory of variable length raises interesting questions from the 
point of view of statistics. Examples are the rate of convergence and the fluctuations of the 
algorithm Context and other estimators of the model. Another challenging question would be 
how to produce a robust version of the algorithm Context. 

But also from the point of view of probability theory, this class of models is interesting. In 
effect, if the length of the contexts is not bounded, then chains with memory of variable length 
are chains of infinite order. Existence, uniqueness, phase-transitions, perfect simulation are deep 
mathematical questions that should be addressed to in this new and challenging class of models. 

Last but not least, models of variable length revealed to be very performing tools in applied 
statistics, by achieving in an efficient way classification tasks in proteomics, genomics, linguistics, 
classification of musical styles, and much more. 

In what follows we present a personal introductory guide to this class of models with no 
attempt to give a complete survey of the subject. We will mainly focus on the algorithm 
Context and present some recent results obtained by our research team. 

2 Probabilistic context trees 

In what follows A will represent a finite alphabet of size \A\. Given two integers m < n, we 
will denote by x^ the sequence (x m , . . . ,x n ) of symbols in A. Let A*, be the set of all finite 
sequences, that is 

CO 

A\ = Q A{- fc --- 1 >. 
fc=i 

We shall write A = A'-' , ~ n '' , ~ 2, ~ 1 ', and denote by xZ^ any element of A. 
Our main object of interest is what we shall call context length function. 

Definition 2.1 A context length function I is a function I : A* + — ► {1, 2, . . .}U{oo} satisfying 
the following two properties, 
(i) For any k > 1, for any xZ_ k G A* + , we have 

l(xZl) G {!,...,&} U{+oo}. 



(ii) For any x_ oa G A, ifl{x_ k ) = k for some k > 1, i/ien 

K x -« ) = °°' / or an V i < k 
K x -i ) = k) / or an ^ i > k. 

Intuitively, given a sequence xl^, the function I tells us, at which position in the past we 
can stop since we have reached the end of the context. The first condition is a kind of adaptivity 
condition. It tells us that we can decide whether the end of the context has already been reached 
at step k just by inspecting the past sequence up to that step. If I equals +00, we have to look 
further back in the past. The second condition is a consistency condition. It tells us that once 
we have reached the bound of the context, we do not have to look further back in the past, and 
that the context of a longer sequence xZi , i > k, is also the context of x~ fc . In other terms, once 
the identification of the context is made at a given step k, this decision will not be changed by 
any further data present in the past before k. 

By abuse of notation, we shall also call I the natural extension of the context length function 
to A, given by 

l(xZ 1 00 )=mi{k>l: l(xZl)<+oo}, 

with the convention that inf = +00. 

Definition 2.2 For any x~ ' G A, we shall call x~ ,, _■, , the context associated to I of the 
infinite sequence x 



-1 
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Definition 2.3 Let I be a given context length function. A stationary stochastic chain 
(X n ) n( z% taking values in A is a chain having memory of variable length, if for any infinite 
past aCoo G A and any symbol a € A, we have 

P (X = a\Xzl = xZl) = P (X = a\X-_l zl) = x^J . (2.1) 

We shall use the short hand notation 

p(a\xZl) = P(X = a\Xzl = xZ\) ■ (2.2) 

We are mainly interested in those values o/p(a|x~ fc ) where k = /(xl^). 

Observe that the set {((XZoo) = k} is measurable with respect to the <r-algebra generated 
by XZ k - Thus we have 

Proposition 2.4 Let (X n ) be a stationary chain as in definition 1 2. ffl having context length 
function I. Put JF^ = cr{X_fc, . . . , X_i}, k > 1. Then I^XZqq) is a (JF\Z) k — stopping time. 

Given a context length function I, we define an associated countable subset r C A* + by 

r = T l = {xZ\ :k = l(xZl),k>l}. 
To simplify notation, we will denote by x and y_ generic elements of r. 

Definition 2.5 Given a finite sequence aC fc , we shall call suffix of xZ k each string x~ ■ with 
j < k. If j < k we call xz) proper suffix of xz\- Now let S C A* + . We say that S satisfies the 

k G S, there exists a proper suffix xZ~ G S of xZ k - 



suffix property if for no x _, G S, there exists a proper suffix x_- G S of x 



The following proposition follows immediately from property (ii) of definition 12.11 

Proposition 2.6 Given a context length function I, the associated set t 1 satisfies the suffix 
property. 

As a consequence, r can be identified with the set of leaves of a rooted tree with a countable 
set of finite labeled branches. 

Definition 2.7 We call probabilistic context tree on A the ordered pair (r,p), where 

P = {p{-\x), x£t} 

is the family of transition probabilities of \2.2\i . We say that the probabilistic context tree (r,p) 
is unbounded if the function I is unbounded. 

Definition 2.8 Let (X n ) n ^^ be a stationary chain and let (r,p) be a probabilistic context 
tree. We shall say that (X n ) n& ^ is compatible with (t, p), if \2.2\i holds for all x£t. 

In order to illustrate these mathematical concepts, let us consider the following example. 

Example 2.9 Consider a two-symbol alphabet A = {0, 1} and the following context length 
function 

J(z-L) = inf {k : X- k = !}• 
Then the associated tree r is given by 

T = {W k ,k>0}, 

where 10 represents the sequence (x-k-i,X-k, ■ ■ ■ ,X—i) such that x_j = for all 1 < i < k and 

X-k-l = 1- 

The associated transition probabilities are defined by 

P(X = 1|X_! = . . . = X_ fe = 0, X_ fc _! = 1) = q k , k > 0, 

with < qj. < 1. 

Clearly, the stochastic chain associated to this context length function / is a chain of infinite 
order. This raises the mathematical question of existence of such a process. It is straightforward 
to see that the following proposition holds true. 

Proposition 2.10 Suppose that Xq = 1. Put T\ = inf{/c > 1 : X^ = 1}. A necessary and 
sufficient condition for T\ < +oo almost surely is 

E 9fc = +oo. (2.3) 

fc>0 

This means that if (|2.3|) is satisfied, then - provided the chain starts from 1 - almost surely 
there will be appearance of an infinite number of the symbol 1. This implies that there exists a 
non-trivial stationary chain associated to this probabilistic context tree. 

Observe that the process X n is actually a renewal process with the renewal times defined as 
follows. 

Tq = sup{n < : X n = 1}, 



and for k > 1, 

T k = inf{n > T k _ x : X n = 1} and T_ fc := sup{n < r_( fe _y : X n = 1}. 

This example shows clearly that the tree of contexts defines a partition of all possible pasts 
with the exception of the single string composed of all symbols identical to 0. The condition 
(|2.3|) shows that it is possible to construct the chain taking values in the set of all sequences 
having an infinite number of symbols 1 both to the left and to the right of the origin. 

However, we could also include this exceptional string to the set of possible contexts by 
defining an extra parameter q^. This is the choice of Csiszar and Talata (2006). If qoo is strictly 
positive, then condition (j2.3f) implies that after a finite time, there will appearance of the symbol 
1, even if we start with an infinity of symbols 0. In other terms, this exceptional string does not 
have to be considered if we are interested in the stationary regime of the chain. 

In case that qoo = and (|2.3p holds, we have the phenomenon of phase transition. One of 
the phases is composed of only one string having only the symbol 0. 

The renewal process is an interesting example of a chain having memory of unbounded 
variable length. In the case where the probabilistic context tree is bounded, the corresponding 
chain is in fact a Markov chain whose order is equal to the maximal context length. However, the 
tree of contexts provides interesting additional information concerning the dependencies in the 
data and the structure of the chain. This raises the issue how to estimate the context tree out 
of the data. This was originally solved in Rissanen's 1983 paper using the algorithm Context. 

At this point it is important to discuss the following minimality issue. Among all possible 
context trees fitting the data, we want of course to identify the smallest one. This is the tree 
corresponding to the smallest context length function. More precisely, if I and I 1 are context 
length functions, we shall say that / < /' if [(xZ^) < ''(^Coo) for any string xZ^ € A. From now 
on we shall call context of a string xZ^ the context associated to the minimal context length 
function. Estimating this minimal context is precisely the goal of the algorithm Context. 

3 The algorithm Context 

We now present the algorithm Context introduced by Rissanen (1983). The goal of the algorithm 
is to estimate adaptively the context of the next symbol X n given the past symbols Xq ~ . The 
way the algorithm Context works can be summarized as follows. Given a sample produced by a 
chain with variable memory, we start with a maximal tree of candidate contexts for the sample. 
The branches of this first tree are then pruned starting from the leaves towards the root until 
we obtain a minimal tree of contexts well adapted to the sample. We associate to each context 
an estimated probability transition defined as the proportion of time the context appears in the 
sample followed by each one of the symbols in the alphabet. We stop pruning once the gain 
function exceeds a given threshold. 

Let Xq,Xi, . . . ,X n —i be a sample from the finite probabilistic tree (r,p). For any finite 
string x~a with j < n, we denote N n (xZj) the number of occurrences of the string in the sample 



it — j 
N n (xZ l j ) = J2l{x t t +j - 1 = xZ 1 j }. (3.4) 



Rissanen first constructs a maximal candidate context X™_ M ,s where M{n) is a random length 
defined as follows 

M(n) = min {i = 0, 1, . . . , [d lognj : N n {X n n z]) > -S^=} ■ (3.5) 



Here C\ and C 2 are arbitrary positive constants. In the case the set is empty we take M(n) = 0. 

Rissanen then shortens this maximal candidate context by successively pruning the branches 
according to a sequence of tests based on the likelihood ratio statistics. This is formally done 
as follows. 

If J2beA N n (xZ.Jb) > 0, define the estimator of the transition probability p by 

, N„ (x~ ,a) 

Pn(a\xZl) = - \- lu, (3-6) 

where aC a denotes the string {x-j, ■ ■ ■ , , x_i, a), obtained by concatenating x~ • and the symbol 

a - ^HbeA N n{^Z\b) = 0, define p n (a\xZ\) = 1 /\ A \- 
For i > 1 we define 

p n (a\xZ}y) 



yeAaeA 

where yxZi denotes the string (y, x-i, . . . , x_i), and where 

N n (yxZ}a) 



p n (a\x_}) 



(3.7) 



p n {a\x_]y) 



T,b€A N n(yX_ib) 



Notice that A n (xlj) is the log- likelihood ratio statistic for testing the consistency of the 
sample with a probabilistic suffix tree (r, p) against the alternative that it is consistent with 
(r',p') where r and r' differ only by one set of sibling nodes branching from xZ_i • A n (xl i ) plays 
the role of a gain function telling us whether it is worth or not taking a next step further back 
in the past. 

Rissanen then defines the length of the estimated current context £ n as 

4W" 1 ) = 1 + max {i = 1, . . . , M{n) - 1 : K{X n n Z}) > C 2 logn} , (3.8) 

where C 2 is any positive constant. 

Then, the result in Rissanen (1983) is the following. 

Theorem 3.1 Given a realization Xq, . . . , X n -\ of a probabilistic suffix tree (r,p) with finite 
height, then 

P^W-^W" 1 ))— >0 (3-9) 

asn->oo. 

Rissanen proves this result in a very short and elegant way. His starting point is the following 
upper bound. 

n— 1\ / fifv~n— 1 



p 4W 1 ) + £(xr L ) < 



'«-l, , „ F ii-hi v / r«- \ ^ Qn \ ( ( v n-l \ ^ C~2n 



p £ n (xr i ) + i(xt l )\nj xr; IY n-i, > -*= ^ PM kz ( ^ > 






C 2 n 



(3-10) 
n 



Then he provides the following explicit upper bound for the conditional probability in the 
right-hand side of (|3.10j) 



P (e n (X^) + i(X^)\N n [X : -J e{xrl) ) > -^=) < Cl logne—V*** , (3.11) 

where C\, C2 and C' 2 are positive constants independent of the maximum of the context length 
function. 

With respect to the second term he only observes that, by ergodicity, for each x~Z k G r we 
have 

p ( w "(^) £ ^)^° (3 ' 12) 

as n — ► 00. Since r is finite the convergence in f)3. 12|) implies the desired result. 

4 The unbounded case 

In his original paper, Rissanen was only interested in the case of bounded context trees. However, 
from the mathematical point of view, it is interesting to consider also the case of unbounded 
probabilistic context trees corresponding to chains of infinite order. It can be argued that also 
from an applied point of view the unbounded case must be considered as noisy observation of 
Markov chains generically have infinite order memory. 

The unbounded case raises immediately the preliminary question of existence and uniqueness 
of the corresponding chain. This issue can be addressed by adapting to probabilistic context 
trees the conditions for existence and uniqueness that have already been proved for infinite order 
chains. This is precisely what is done in the paper by Duarte et al. (2006) who adapt the type 
A condition presented in Fernandez and Galves (2002) in the following way. 

To simplify the presentation, let us introduce some extra notation. Recall that x and y 
denote generic elements of r. Given x = xZi and y_ = y~ • , we shall write x = y_ if and only if 
k < mm{i,j} and x-\ = j/_i, ... ,x_ fc = y- k - 

Definition 4.1 A probabilistic suffix tree (r,p) on A is of type A if its transition probabilities 
p satisfy the following conditions. 

1. Weakly non-nullness, that is 

^infp(a|i) > 0; (4.13) 

a£A~ 

2. Continuity, that is 

k 

(3(k) = maxsup{|p(o|x) — p(a\y)\,y, G r,x G r with x = y} — ► (4-14) 

a&A 

as k — > 00. We also define 

(5(0) = maxsup{|p(a|x) — p(a\y)\,y £ r, x G r with X-\^y-{\. 

adA 

The sequence {(3(k)}k G IN is called the continuity rate. 



For a probabilistic suffix tree of type A with summable continuity rate, the maximal coupling 
argument used in Fernandez and Galves (2002) implies the uniqueness of the law of the chain 
consistent with it. 

We now present a slightly different version of the algorithm Context using the same gain 
function A n but in which the length of the maximum context candidate is now deterministic and 
nor more random. More precisely, we define the length of the biggest candidate context now as 

k(n) = C 1 \ogn (4.15) 

with a suitable positive constant C\. 

The intuitive reason behind the choice of the upper bound length C\ logra is the impossibility 
of estimating the probability of sequences of length much longer than log n based on a sample 
of length n. Recent versions of this fact can be found in Marton and Shields (1994, 1996) and 
Csiszar (2002). 

Now, the definition of t n is similar to the one in the original algorithm of Rissanen, that is 

i n {X^~ l ) = 1 + max [i = 1, . . . , fc(n) - 1 : K{Kzl) > C 2 logn} , (4.16) 

where C 2 is any positive constant. 

The reason for taking the length of the maximum context candidate deterministic and no 
more random is to be able to use the classical results on the convergence of the law of A n (aCj) 
to a chi-square distribution. However, we are not in a Markov setup since the probabilistic 
context tree is unbounded, and the chi-square approximation only works for Markov chains of 
fixed finite order. 

To overcome this difficulty, we use the canonical Markov approximation of chains of infinite 
order presented in Fernandez and Galves (2002) that we recall now by adapting the definitions 
and theorem to the framework of probabilistic context trees. The goal is to approximate a chain 
compatible with an unbounded probabilistic context tree by a sequence of chains compatible 
with bounded probabilistic context trees. 

Definition 4.2 For all k > 1, the canonical Markov approximation of order k of a chain 
(X n ) n£ % is the chain with memory of variable length bounded by k compatible with the proba- 
bilistic context tree (ri- k > , p*- k >) where 

T W ={ier;I(i) < k}U{xZl;x€T,l(x) > k} (4.17) 

for all a £ A, x £ r, and where 

pl k \a\xZ}) := P(X = a\Xz) = xZ)) (4.18) 

for all xZ) G r^. 

Observe that for contexts x £ r which length does not exceed k, we have p^(a|x) = p(a|x). 
However, for sequences x~ fc which are internal nodes of r, there is no easy explicit formula 
expressing p 1 \-\xZu) in terms of the family {p(-|y_),y £ r}. 

The main result of Fernandez and Galves (2002) that will be used in the proof of the con- 
sistency of the algorithm Context can be stated as follows. 



Theorem 4.3 Let (X n ) ne % be a chain compatible with a type A probabilistic context tree 
(r,p) with summable continuity rate, and let (Xn ) n e% be its canonical Markov approximation 
of order k. Then there exists a coupling between {X n ) n& % and (Xh ) ne ^ and a constant C > 
such that 

P (Xo ± Xf) < C/3(k) . (4.19) 



Using this result and the classical chi-square approximation for Markov chains, Duarte et al. 
(2006) proved the consistency of their version of the algorithm Context in the unbounded case 
and also provided an upper bound for the rate of convergence. Their result is the following. 

Theorem 4.4 Let Xq,X 2 , . . . , X n _i be a sample from, a type A unbounded probabilistic suffix 
tree (r,p) with continuity rate (3{j) < f(J) exp{— j}, with f(j) — > as j — > 00. Then, for any 
choice of positive constants C\ and C 2 in |^. J5[ ) and j^. 16\ ), there exist positive constants C and 
D such that 

P (iniX^ 1 ) + KX^- 1 )) < d logn(n- c * + D/n) + Cf(C x logn) . 

The proof can be sketched very easily. Take k = k(n) = C\ log(n) and construct a coupled 
version of the processes (Xt)t&x and {X t )t&%- First of all notice that for k = k(n), 



P (£n(X , ..., X n _i) + i(X , ..., X n ^)) < 

P (UxW. . . *) + l(xf,. . . xWj) + P ^{Xt + Xf^ . (4.20) 

Using the inequality (|4.19p of Fernandez and Galves (2002), the second term in (|4.20p can be 
bounded above as 

p({J{X^xl k] }\<nC(3(k(n)). 

The first term in (|4.20p can be treated using the classical chi-square approximation for the 
log- likelihood ratio test for Markov chains of fixed order k. 

More precisely, we know that for fixed aCj , under the null hypothesis, the statistics A n (xZi), 
given by (|3.7fl . has asymptotically chi-square distribution with \A\ — 1 degrees of freedom (see, 
for example, van der Vaart (1998)). We recall that, for each xZi the null hypothesis (Hq) is that 
the true context is xZ{ ■ 

Since we are going to perform a sequence of k(n) sequential tests where k(n) — > 00 as n 
diverges, we need to control the error in the chi-square approximation. For this, we use a well- 
known asymptotic expansion for the distribution of A n (aCj) due to Hayakawa (1977) which 
implies that 

P (X n (xzl) < t\H l ) =p( x 2 <t)+ D/n , (4.21) 

where D is a positive constant and x 2 1S random variable with distribution chi-square with 
\A\ — 1 degrees of freedom. 

Therefore, it is immediate that 



p(kn{xZ}) > C 2 \ogn) <e- c ^°^ n + D/n. 



By the way we defined £ n in (|4.16p . in order to find £„(Xq _ ) we have to perform at most 
k(n) tests. We want to give an upper bound for the overall probability of type I error in a 
sequence of k(n) sequential tests. An upper bound is given by the Bonferroni inequality, which 
in our case can be written as 

fe(n) 

P (uJglAJaQ 1 ) > C 2 logn}|flS) < £ P{K{xZ\) > C 2 logn\H l ). 

i=2 

This last term is bounded above by C\ log n{n~ C2 + D/n). This concludes the proof. 

Theorem 14.41 not only proves the consistency of the algorithm Context, but it also gives an 
upper bound for the rate of convergence. The estimation of the rate of convergence is crucial 
because it gives a bound on the minimum size of a sample required to guarantee, with a given 
probability, that the estimated tree is the good one. This is the issue we address to in the next 
section. 

5 Rate of convergence of the algorithm Context 

Note that Rissanen's original theorem 13. II as well as theorem 14.41 only show that all the contexts 
identified are true contexts with high probability. In other words, the estimated tree is a subtree 
of the true tree with high probability. In the case of bounded probabilistic context trees this 
missing point was handled with in Weinberger et al. (1995). This paper not only proves that 
the set of all contexts is reached, but also gives a bound for the rate of convergence. 
More precisely, let us define the empirical tree 

Xj:l j{xn :j = n/2,...,n). (5.22) 

Actually, this is a slightly simplified version of the empirical tree defined in Weinberger et al. 
(1995). In particular, we are neglecting all the computational aspects considered there. But 
from the mathematical point of view, this definition perfectly does the job. Their convergence 
result is the following. 

Theorem 5.1 Let (r,p) be a bounded probabilistic context tree and let Xq, . . . ,X n be com- 
patible with (t,p). Then we have 

Y^ Pijn / r) log n < +oo. 

n>l 

In the unbounded case, this issue was treated without estimation of the rate of convergence 
in Ferrari and Wyner (2003) and including estimation of the rate of convergence in Galves and 
Leonardi (2008). 

This last paper considers another slightly modified version of the algorithm Context using a 
different gain function, which has been introduced in Galves et al. (2007). More precisely, let 
us define for any finite string x~ fc G A* + the gain function 

„-i\ i- („\~-l\ a r„\„-i 



A„(x_ fc ) = max \p n {a\x_ k ) - Pn(a\x_ {k _ 1) )\. 



This gain function is well adapted to use exponential inequalities for the empirical transition 
probabilities in the pruning procedure rather than the chi-square approximation of the log- 
likelihood ratio as in theorems 13.11 and 14.41 



The theorem is stated in the following framework. Consider a a stationary chain (X n ) ne % 
compatible with an unbounded probabilistic context tree (r,p). For this chain, we define the 
sequence (a n ) ne]N by 



«o = ^ inf p(a|x) 

a£A~ 



a,. 



inf Y^ inf „ , P( a h)- 

x -n a&A zer:/( z )>n, z =x_^ 



We assume that the probabilistic context tree (r, p) satisfies the condition (|4,13p of weakly 
non-nullness, that is qq > 0. We assume also the following summability condition 



a= ^(1 -a„) < +00. (5.23) 

n>0 

Given a sequence x\ = (x\, . . . , Xj) & A 3 we denote by 

p(x{) = JP(X{ =x{). 
Then for an integer m > 1, we define 

D m = min iasx.{\p(a\xZi) - p(.a\xZ}i._u)\}, (5.24) 

and 

e m = min{ p{xZ k )'- k <m and p(xZ k ) > }. (5.25) 

Intuitively, D m tells us how distinctive is the difference between transition probabilities associ- 
ated to the exact contexts and those associated to a string shorter one step in the past. We do 
not want to impose restrictions on the transition probabilities elsewhere then at the end of the 
branches of the context tree. This has to do with the pruning procedure which goes from the 
leaves to the root of the tree. 

In the unbounded case, a natural way to state the convergence results is to consider truncated 
trees. The definition is the following. Given an integer K we will denote by t\k the tree r 
truncated to level K, that is 

t\k = { x Z k £ T:k < K} U {xZk such that x^ G r for some k > K}. 

Actually, this is exactly the same tree which was called t' > in (|4.17p . The notation t\k is more 
suitable for what follows. 

The associated empirical tree of height k is defined in the following way. 

Definition 5.2 Given 5 > and k < n, the empirical tree is defined as 
f^ = {xZl \<r<k: A n (xZJ.) > 5 A A^yZ^xZl) < 5, V ylf +1) , 1 < j < k - r}. 

In case r = k, the string y_) r+ ■( is empty. 

Note that in this definition, the parameter 5 expresses the coarseness of the pruning criterion 
and k is the maximal length of the estimated contexts. 

Now, Galves and Leonardi (2008) obtain the following result on the rate of convergence for 
the truncated context tree. 



Theorem 5.3 Let (r,p) be a probabilistic context tree satisfying UTTB ) and 115.23]) . Let 
Xq, . . . , X n be a stationary stochastic chain compatible with (r,p). Then for any integer K, any 
k satisfying 

k> maxmin{f(i/):]/£T, x = y}, (5.26) 

for any 5 < D k and for each 

n> .y' + 1 ». +t (5.27) 

mm(d,D k - d)e k 

we have that 

P^\K + r\ K ) < 4,i |^«exp[-(„-*) ' " 4 |'V(*-Ti) !■ 

where 

C= a ° 



[a + ao) 



In this theorem, the empirical trees have to be of height k > K for the following reason. 
Truncating r at level K implies that contexts longer than K are cut before reaching their end, 
and associated transition probabilities might not differ when comparing them at length K and 
K — 1. That's why we consider the bigger empirical tree of height k satisfying condition (|5.26p . 
This guaranties that for each element x of the truncated empirical tree there is at least one real 
context y_ which has x as its suffix. 

As a consequence of theorem I5.3( Galves and Leonardi (2008) obtain the following strong 
consistency result. 

Corollary 5.4 Let (r,p) be a probabilistic context tree satisfying the conditions of theorem 
HPl Then 

? 5 /\k = t\ k , (5.28) 

eventually almost surely, as n — > oo. 

The main ingredient of the proof of theorem 15.31 is an exponential upper bound for the 
deviations of the empirical transition probabilities. More precisely, Galves and Leonardi (2008) 
prove the following result. 



Theorem 5.5 For any finite sequence x_ k with p(x_ k ) > 0, any symbol a £ A, any t > 
- k the following inequality holds. 

P(\Pn(a\xZl) - P (a\xZl)\ > t) < 



and any n > , -i, + k the following inequality holds 



[t - W+\ ] 2 p(xZl) 2 C 
2\A\ e 1 - exp[-(n-fc) ( " ^ , ], (5.29) 



where 

C = —^ -. (5.30) 

8e(a + ao) 

The proof of this theorem is inspired by recent exponential upper bounds obtained by 
Dedecker and Doukhan (2003), Dedecker and Prieur (2005) and Maume-Deschamps (2006). 
It is based on the following loss-of-memory inequality of Comets et al. (2002). 



Theorem 5.6 Let {X n ) n& %, be a stationary stochastic chain compatible with the probabilistic 
context tree (r,p) of theorem \5.3[ Then, there exist a sequence {pi}i£iN such that for any i > 1, 
any k > i, any j > 1 and any finite sequence x\, the following inequality holds 

sup \P{X k k ^~ l = x{\X{ = y{)-p{x{)\ < jp^-x ■ (5.31) 

Moreover, the sequence {pi}ieiN is summable and 

PI < 1 + — ■ 

lew «o 

Theorem 15.31 generalizes to the unbounded case previous results in Galves et al. (2008) 
for the case of bounded context trees. Note that the definition of the context tree estimator 
depends on the parameter 5, the same appearing in the constants of the exponential bound. To 
assure the consistency of the estimator we have to choose a 5 sufficiently small, depending on 
the true probabilities of the process. The same thing happens to the parameter k. Therefore, 
this estimator is not universal, meaning that for fixed 6 and k it fails to be consistent for all 
variable memory processes for which conditions (|5.26j) and ()5.27|) are not satisfied. We could 
try to overcome this difficulty by letting 5 = 6(ri) — ► and k = k{n) — ► +oo as n increases. But 
doing this, we loose the exponential character of the upper bound. This could be considered 
as an illustration of the result in Finesso et al. (1996) who proved that in the simpler case of 
estimating the order of a Markov chain, it is not possible to have a universal estimator with 
exponential bounds for the probability of overestimation. 

6 Some final comments and bibliographic remarks 

Chains with memory of variable length were introduced in the information theory literature 
by Rissanen (1983) as a universal system for data compression. Originally called by Rissanen 
tree machine, tree source, context models, etc., this class of models recently became popular in 
the statistics literature under the name of Variable Length Markov Chains (VLMC), coined by 
Biihlmann and Wyner (1999). 

Rissanen (1983) not only introduced the notion of variable memory models but he also 
introduced the algorithm Context to estimate the probabilistic context tree. From Rissanen 
(1983) to Galves et al. (2008), passing by Ron et al. (1996) and Biihlmann and Wyner (1999), 
several variants of the algorithm Context have been presented in the literature. In all the variants 
the decision to prune a branch is taken by considering a gain function. 

Rissanen (1983), Biihlmann and Wyner (1999) and Duarte et al. (2006) all defined the 
gain function in terms of the log likelihood ratio function. Rissanen (1983) proved the weak 
consistency of the algorithm Context in the case where the contexts have a bounded length. 
Biihlmann and Wyner (1999) proved the weak consistency of the algorithm also in the finite 
case without assuming a prior known bound on the maximal length of the memory but using a 
bound allowed to grow with the size of the sample. 

A different gain function was introduced in Galves et al. (2008), considering differences 
between successive empirical transition probabilities and comparing them with a given threshold 
5. An interesting consequence of the use of this different gain function was obtained by Collet 
et al. (2007). They proved that in the case of a binary alphabet and when taking 5 within a 
suitable interval, it is possible to recover the context tree in the bounded case out from a noisy 
sample where each symbol can be flipped with small probability independently of the others. 



The case of unbounded probabilistic context trees as far as we know was first considered by 
Ferrari and Wyner (2003) who also proved a weak consistency result for the algorithm Context in 
this more general setting. The unbounded case was also considered by Csiszar and Talata (2006) 
who introduced a different approach for the estimation of the probabilistic context tree using 
the Bayesian Information Criterion (BIC) as well as the Minimum Description Length Principle 
(MDL). We refer the reader to this last paper for a nice description of other approaches and 
results in this field, including the context tree maximizing algorithm by Willems et al. (1995). 
We also refer the reader to Garivier (2006a, b) for recent and elegant results on the BIC and 
the Context Tree Weighting Method (CTW). Garivier (2006c) is a very good presentation of 
models having memory of variable length, BIC, MDL, CTW and related issues in the framework 
of information theory. 

With exception of Weinberger et al. (1995), the issue of the rate of convergence of the 
algorithm estimating the probabilistic context tree was not addressed in the literature until 
recently. Weinberger et al. (1995) proved in the bounded case that the probability that the 
estimated tree differs from the finite context tree is summable as a function of the sample size. 
Assuming weaker hypotheses than Ferrari and Wyner (2003), Duarte et al. (2006) proved in the 
unbounded case that the probability of error decreases as the inverse of the sample size. 

Leonardi (2007) obtained an upper bound for the rate of convergence of penalized likelihood 
context tree estimators. It showed that the estimated context tree truncated at any fixed height 
approximates the real truncated tree at a rate that decreases faster than the inverse of an 
exponential function of the penalizing term. The proof mixes the approaches of Galves et al. 
(2008) and Csiszar and Talata (2006). 

Several interesting papers have recently addressed the question of classification of proteins 
and DNA sequences using models with memory of variable length, which in bio-informatics are 
often called prediction suffix trees (PST). Many of these papers have been written from a bio- 
informatics point of view focusing on the development of new tools rather than being concerned 
with mathematically rigorous proofs. The interested reader can find a starting point to this 
literature for instance in the papers by Bejerano et al. (2001), Bejerano and Yona (2001), Eskin 
et al. (2000), Leonardi (2006) and Miele et al. (2005). The same type of analysis has been 
used successfully to classification tasks in other domains like musicology (Lartillot et al. 2003), 
linguistics (Selding et al. 2001), etc. 

This presentation did not intend to be exhaustive and the bibliography in many cases only 
gives a few hints about possible starting points to the literature. However, we think we have 
presented the state of the art concerning the rate of convergence of context tree estimators. 

In the introduction we said that Rissanen's ingenious idea was to construct a stochastic 
model that generalizes the notion of relevant domain (in biology or linguistics) to any kind of 
symbolic strings. Actually, God only knows what Jorma had in mind when he invented this 
class of models. The French poet Paul Eluard wrote a book called Les freres voyants. This was 
the name given in the middle-age to people guiding blind persons. So maybe Rissanen acted 
as a frere voyant using his intuition to push mathematics and statistics into a challenging new 
direction. 
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